YouTube API CoverImage.png

Sebastian Quirarte | sebastianquirajus@gmail.com | in/sebastianquirarte | Last Updated: Nov 29 2023 | GitHub

Objective¶

The aim of this project is to extract, transform, and visualize data from any YouTube channel and all its videos using the YouTube API using Python and PowerBI. In this case I wanted to analyze one of my favorite channels: Kurzgesagt – In a Nutshell.

Overview¶

  1. Data Extraction and Transformation: Extracting data directly from YouTube through their API and storing this data into dataframes.

  2. Data Preprocesing: Converting data types, creating new columns and cleaning dataframes.

  3. Visualization and Analysis Visualizing and analyzing the data obtained from all the videos uploaded by the channel of interest.

  4. Dashboard: Creating a dashboard in PowerBI of the data collected to display the project results in a concise manner.

Libraries Used¶

  • googleapiclient: The official Python client library for Google's discovery based APIs.
  • pandas: data manipulation, analysis, and data structures and operations for manipulating numerical tables and time series.
  • IPython: interactive command-line terminal for Python. Used in this case for JSON formatting.
  • dateutil: helps you manipulate and work with dates and time stamps.
  • isodate: implements ISO 8601 date, time and duration parsing.
  • youtubeAPI: functions written specifically for this project, available in file 'youtubeAPI.py'
  • seaborn: data visualization library based on matplotlib.
  • matplotlib: comprehensive library for creating static, animated, and interactive visualizations in Python.
  • nltk: symbolic and statistical natural language processing (NLP) for English written in Python.
  • wordcloud: word cloud generator in Python.

Results¶

Plots Using Python¶

Plots1.png Plots2.png Plots3.png

Dashboard PowerBI (LINK)¶

YoutubeAPI Dashboard.png

Proyect¶


Note: import missing libraries using pip install in the terminal if necesary.

In [1]:
# General Libraries
from googleapiclient.discovery import build
import pandas as pd
from IPython.display import JSON
from dateutil import parser
import isodate

# Functions written specifically for this project
from youtubeAPI import *

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

# Natural Lenguage Processing (NLP)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud

To start, we generate an API key from the Google API console following the steps mentioned at: https://developers.google.com/youtube/v3/getting-started

PLEASE CHANGE THIS KEY TO YOUR OWN PERSONAL KEY IF YOU WANT TO RUN THIS CODE

Note: The quota limit for the YouTube API is 10,000 units per day

In [2]:
api_key = 'AIzaSyAnQgjZ7IL_ZV1qfsjCAnu9t7Hmjn9KE1c' # CHANGE KEY

We'll add the channel ID of the channel we want to analyze. This can be obtained directly from the YouTube channel page url or from online tools available through a simple google search.

In [3]:
channel_id = ['UCsXVk37bltHxD1rDPwtNM8Q'] # Kurzgesagt – In a Nutshell

Let's set up our YouTube API.

In [4]:
api_service_name = "youtube"
api_version = "v3"

# Get credentials and create an API client
youtube = build(
    api_service_name, api_version, developerKey=api_key)

Data Extraction and Transformation¶

We can request the data of the channels we are interested in and view the results in JSON format.

In [5]:
request = youtube.channels().list(
            part="snippet,contentDetails,statistics",
            id=channel_id
)
response = request.execute()

JSON(response)
Out[5]:
<IPython.core.display.JSON object>

After taking a look at the structure of the JSON data, we'll create a function that requests data and stores it as a dictionary inside a list and returns it as a dataframe.

The statistics we're going to extract are:

  • Channel name
  • Subscribers
  • Views
  • Total videos
  • Playlist ID (will be used to get all video IDs from the channel)

By calling our function we can extract the data for each of the YouTube channels and view them in a dataframe.

In [6]:
channel_stats = get_channel_stats(youtube, channel_id)
channel_stats
Out[6]:
Channel_Name Total_Subscribers Total_Views Total_Videos PlaylistID
0 Kurzgesagt – In a Nutshell 21500000 2527532973 198 UUsXVk37bltHxD1rDPwtNM8Q

Next, let's create a second function that loops through the channel's video playlist and returns information about all of their video's IDs.

Calling our get_video_ids function returns a list of all video IDs for our specified playlist.

In [8]:
# Gets playlistID from channel_stats dataframe
playlist_id = channel_stats.iloc[0]["PlaylistID"]
# Gets all video IDs from YouTube channel
video_ids = get_video_ids(youtube, playlist_id)

Now that we have all the video IDs from a specified channel, we can request the data from each video using the following function.

In [9]:
video_df = get_video_details(youtube, video_ids)
video_df
Out[9]:
video_id channelTitle title description tags publishedAt viewCount likeCount commentCount duration definition caption
0 fuFlMtZmvY0 Kurzgesagt – In a Nutshell The Internet is Worse Than Ever – Now What? Go to https://ground.news/nutshell to get 30% ... None 2023-11-29T15:00:43Z 2871336 204225 16523 PT11M32S hd true
1 FgnjdW-x7mQ Kurzgesagt – In a Nutshell The Last Thing To Ever Happen In The Universe Get 50% off your first month of KiwiCo at http... None 2023-11-21T15:00:01Z 3639408 198112 9440 PT11M5S hd true
2 tybKnGZRwcU Kurzgesagt – In a Nutshell How to Win an Interstellar War ✨ Join the Galactic Club: your cozy corner in ... None 2023-11-15T15:00:01Z 4489106 212122 10615 PT12M43S hd true
3 S7TUe5w6RHo Kurzgesagt – In a Nutshell 4.5 Billion Years in 1 Hour Learn more about how complex life evolved with... None 2023-11-10T16:15:06Z 5986432 430706 22717 PT1H3M31S hd true
4 Qtmrp_4BKRQ Kurzgesagt – In a Nutshell A Glimpse Into Our Future ✨ WORLDWIDE SHIPPING AVAILABLE ✨ The 12,024 Hu... None 2023-10-17T14:04:14Z 1481809 53767 1516 PT1M42S hd true
... ... ... ... ... ... ... ... ... ... ... ... ...
193 F3QpgXBtDeo Kurzgesagt – In a Nutshell How The Stock Exchange Works (For Dummies) Why are there stocks at all?\n\nEveryday in th... [Stock market, wallstreet, stock exchange, mar... 2013-11-28T17:03:32Z 8297195 130534 7270 PT3M34S hd true
194 UuGrBhK2c7U Kurzgesagt – In a Nutshell The Gulf Stream Explained Learn about the role of the sea in global warm... [global conveyor belt, climate change, global ... 2013-10-11T19:11:39Z 5938614 64868 1984 PT5M4S hd true
195 Uti2niW2BRA Kurzgesagt – In a Nutshell Fracking explained: opportunity or danger Fracking explained in five minutes.\n\nFrackin... [Fracking, controversy, oil, gas, nature, wate... 2013-09-03T09:12:24Z 7254944 101461 8073 PT5M3S hd true
196 KsF_hdjWJjo Kurzgesagt – In a Nutshell The Solar System -- our home in space An Infographic trip through the wonders of the... [Solar system, planets, earth, mars, moon, jup... 2013-08-22T13:24:56Z 6231652 83643 6167 PT7M21S hd true
197 hOfRN0KihOU Kurzgesagt – In a Nutshell How Evolution works The mechanisms of evolution explained in one v... [evolution, science, darwin, design, infograph... 2013-07-11T14:09:52Z 10605269 244000 75674 PT11M48S hd true

198 rows × 12 columns

This is our main dataframe we will visualize and analyze, but first we need to pre-process our data.

Data Pre-Processing¶

Let's take a look at the null values in our data.

In [10]:
# Check for NULL values
video_df.isnull().any()
Out[10]:
video_id        False
channelTitle    False
title           False
description     False
tags             True
publishedAt     False
viewCount       False
likeCount       False
commentCount    False
duration        False
definition      False
caption         False
dtype: bool

There are null values only in our tags column, meaning that some of this channel's videos don't have any tags. Let's check the data types of the columns of our dataframe.

In [11]:
# Check data types
video_df.dtypes
Out[11]:
video_id        object
channelTitle    object
title           object
description     object
tags            object
publishedAt     object
viewCount       object
likeCount       object
commentCount    object
duration        object
definition      object
caption         object
dtype: object

'Object' type doesn't make sense for some of our columns such as view count, let's fix that and convert the necessary columns into numeric format.

In [12]:
# Convert count columns to numeric
numeric_cols = ['viewCount', 'likeCount', 'commentCount']
video_df[numeric_cols] = video_df[numeric_cols].apply(pd.to_numeric, errors = 'coerce', axis = 1)

Now, let's determine the day of the week that each video was published and add these days as a new column.

In [13]:
# Publish day of the week
video_df['publishedAt'] = video_df['publishedAt'].apply(lambda x: parser.parse(x)) 
video_df['pushblishDayName'] = video_df['publishedAt'].apply(lambda x: x.strftime("%A")) 

We'll also convert the YouTube time format to seconds as a new column. For this we can use the isodate library.

In [14]:
# Convert YouTube duration to seconds
video_df['durationSecs'] = video_df['duration'].apply(lambda x: isodate.parse_duration(x))
video_df['durationSecs'] = video_df['durationSecs'].apply(lambda x: x.total_seconds())

Finally, let's add a column to our dataframe that shows the total number of tags that each video has (NULL values will be shown as 0).

In [15]:
# Add tag count
video_df['tagCount'] = video_df['tags'].apply(lambda x: 0 if x is None else len(x))

Here's our final dataframe.

In [16]:
video_df
Out[16]:
video_id channelTitle title description tags publishedAt viewCount likeCount commentCount duration definition caption pushblishDayName durationSecs tagCount
0 fuFlMtZmvY0 Kurzgesagt – In a Nutshell The Internet is Worse Than Ever – Now What? Go to https://ground.news/nutshell to get 30% ... None 2023-11-29 15:00:43+00:00 2871336 204225 16523 PT11M32S hd true Wednesday 692.0 0
1 FgnjdW-x7mQ Kurzgesagt – In a Nutshell The Last Thing To Ever Happen In The Universe Get 50% off your first month of KiwiCo at http... None 2023-11-21 15:00:01+00:00 3639408 198112 9440 PT11M5S hd true Tuesday 665.0 0
2 tybKnGZRwcU Kurzgesagt – In a Nutshell How to Win an Interstellar War ✨ Join the Galactic Club: your cozy corner in ... None 2023-11-15 15:00:01+00:00 4489106 212122 10615 PT12M43S hd true Wednesday 763.0 0
3 S7TUe5w6RHo Kurzgesagt – In a Nutshell 4.5 Billion Years in 1 Hour Learn more about how complex life evolved with... None 2023-11-10 16:15:06+00:00 5986432 430706 22717 PT1H3M31S hd true Friday 3811.0 0
4 Qtmrp_4BKRQ Kurzgesagt – In a Nutshell A Glimpse Into Our Future ✨ WORLDWIDE SHIPPING AVAILABLE ✨ The 12,024 Hu... None 2023-10-17 14:04:14+00:00 1481809 53767 1516 PT1M42S hd true Tuesday 102.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
193 F3QpgXBtDeo Kurzgesagt – In a Nutshell How The Stock Exchange Works (For Dummies) Why are there stocks at all?\n\nEveryday in th... [Stock market, wallstreet, stock exchange, mar... 2013-11-28 17:03:32+00:00 8297195 130534 7270 PT3M34S hd true Thursday 214.0 21
194 UuGrBhK2c7U Kurzgesagt – In a Nutshell The Gulf Stream Explained Learn about the role of the sea in global warm... [global conveyor belt, climate change, global ... 2013-10-11 19:11:39+00:00 5938614 64868 1984 PT5M4S hd true Friday 304.0 19
195 Uti2niW2BRA Kurzgesagt – In a Nutshell Fracking explained: opportunity or danger Fracking explained in five minutes.\n\nFrackin... [Fracking, controversy, oil, gas, nature, wate... 2013-09-03 09:12:24+00:00 7254944 101461 8073 PT5M3S hd true Tuesday 303.0 28
196 KsF_hdjWJjo Kurzgesagt – In a Nutshell The Solar System -- our home in space An Infographic trip through the wonders of the... [Solar system, planets, earth, mars, moon, jup... 2013-08-22 13:24:56+00:00 6231652 83643 6167 PT7M21S hd true Thursday 441.0 28
197 hOfRN0KihOU Kurzgesagt – In a Nutshell How Evolution works The mechanisms of evolution explained in one v... [evolution, science, darwin, design, infograph... 2013-07-11 14:09:52+00:00 10605269 244000 75674 PT11M48S hd true Thursday 708.0 17

198 rows × 15 columns

We can export this dataframe as its own csv file for ease of use in Jupyter Lab.

In [17]:
# Exports dataframe as csv file
video_df.to_csv('video_df.csv')

Let's also export the channel stats as a csv file.

In [18]:
# Exports dataframe as csv file
channel_stats.to_csv('channel_df.csv')

Data Visualization and Analysis¶

Let's create a barplot of the channel's 10 most-viewed videos.

In [30]:
# Format plot size for larger plots
plt.rcParams['figure.figsize'] = [10, 10]

# Creates barplot and sets fundamental parameters
ax = sns.barplot(
    x = 'viewCount', 
    y = 'title', 
    data = video_df.sort_values('viewCount', ascending=False)[0:9], 
    palette="crest", 
    hue="title"
)
# Divides x-axis values by 1,000,000 to convert views to millions
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:'{:,.0f}'.format(x/1000000) + ' M'))
# Labels
plt.xlabel('Views', fontsize='30', labelpad=20)
plt.ylabel('')
# Title
plt.title('Most Viewed Videos', fontsize='50', pad=30)
# y-axis values size
plt.rc('ytick', labelsize=20) 
# x-axis values size
plt.rc('xtick', labelsize=20)

Also the 10 least-viewed videos.

In [20]:
# Creates barplot and sets fundamental parameters
ax = sns.barplot(
    x = 'viewCount', 
    y = 'title', 
    data = video_df.sort_values('viewCount')[0:9], 
    palette = 'crest_r',
    hue = 'title'
)

# Divides x-axis values by 1,000,000 to convert views to millions
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:'{:,.0f}'.format(x/1000) + 'K'))
# Labels
plt.xlabel('Views', fontsize='30', labelpad=20)
plt.ylabel('')
# Title
plt.title('Least Viewed Videos', fontsize='50', pad=30)
# y-axis values size
plt.rc('xtick', labelsize=20) 
# x-axis values size
plt.rc('xtick', labelsize=20)

We can see that 60% (6/10) of the least viewed videos are YouTube shorts, this analysis considers both full-length videos and shorts in the same category.

A violin plot can let us visualize the distribution of the views of all the channel's videos.

In [21]:
# Creates violin plot and sets x and y values
ax = sns.violinplot(
    x = video_df['viewCount'], 
    color = 'mediumseagreen', 
    cut = 0
)
# Set transparency (alpha)
plt.setp(ax.collections, alpha=.5)
# Divides x-axis values by 1,000,000 to convert views to millions
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:'{:,.0f}'.format(x/1000000) + ' M'))
# Labels
plt.xlabel('Views', fontsize='30', labelpad=20)
plt.ylabel('')
# Title
plt.title('Views Across Videos', fontsize='40', pad=30)
# y-axis values size
plt.rc('ytick', labelsize=20)

Let's also calculate the mean, median, max, and minimum values of the views.

In [22]:
# Creates dataframe with views values
views_data = [{"Mean" : format(video_df['viewCount'].mean()/1000000,".2f"),
               "Median" : format(video_df['viewCount'].median() / 1000000,".2f"),
               "Max" : format(video_df['viewCount'].max() / 1000000,".2f"),
               "Min" : format(video_df['viewCount'].min() / 1000000,".2f")
}]

pd.DataFrame(views_data).style.set_caption("Views in Millions")
Out[22]:
Views in Millions
  Mean Median Max Min
0 12.77 10.43 88.46 0.89

Now we'll visualize the relationship between views and likes as well as comments per video.

In [23]:
# Creates scatter plot and sets x and y values, as well as hue and size parameters
ax = sns.scatterplot(
    data = video_df, 
    x = 'viewCount', 
    y = 'likeCount', 
    hue = 'commentCount', 
    size = 'commentCount', 
    sizes = (10, 550), 
    palette = 'crest'
)
# Divides x-axis values by 1,000,000 to convert views to millions
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:'{:,.0f}'.format(x/1000000) + ' M'))
# Divides y-axis values by 1,000 to convert likes to thousands
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda y, pos:'{:,.0f}'.format(y/1000) + ' K'))
# Title
plt.title('Views, Likes, and Comments', fontsize='40', pad=30)
# Labels
plt.xlabel('Views', fontsize='25', labelpad=20)
plt.ylabel('Likes', fontsize='25', labelpad=20)
# Legend
plt.legend(title='Comments', fontsize='15', title_fontsize='20')
Out[23]:
<matplotlib.legend.Legend at 0x15e5aa6d0>

From this scatterplot we can see that there is a positive correlation between the views a video has and its amount of likes and comments. This makes sense as videos with more views have a higher amount of viewers that can interact with it.

Two videos seem to have some interesting data:

  1. 'The Egg - A Short Story' has an outstanding number of comments and likes in regards to its views.
  2. 'The Coronavirus Explained & What You Should Do' has an outstanding number of views (it's this channels most viewed video) but not a proportional amount of likes or comments.

Next, let's create a histogram of the duration of all videos.

In [24]:
# Creates scatter plot and sets x and y values, as well as parameters
sns.histplot(
    data = video_df, 
    x = 'durationSecs',
    bins = 30,
    color = 'mediumseagreen',
    alpha = 0.5
)
# Labels
plt.xlabel('Duration (s)', fontsize='25', labelpad=20)
plt.ylabel('Count', fontsize='25', labelpad=20)
# Title
plt.title('Video Duration', fontsize='40', pad=30)
Out[24]:
Text(0.5, 1.0, 'Video Duration')

Let's implement some natural language processing (NLP) by creating a word cloud of the most common words in all video titles.

In [25]:
stop_words = set(stopwords.words('english'))
video_df['title_no_stopwords'] = video_df['title'].apply(lambda x: [item for item in str(x).split() if item not in stop_words])

all_words = list([a for b in video_df['title_no_stopwords'].tolist() for a in b])
all_words_str = ' '.join(all_words) 

wordcloud = WordCloud(width = 2000, height = 1000, random_state=1, background_color='white', 
                      colormap='crest', collocations=False).generate(all_words_str)
plot_cloud(wordcloud)

Lastly, we'll create a dataframe and barchart that visualizes the upload schedule of all videos based on days of the week.

In [26]:
# Creats dataframe based on the amount of videos uploaded per day
day_df = pd.DataFrame(video_df['pushblishDayName'].value_counts())
# Orders days of the week manually
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Redifines the dataframe index based on days of the week
day_df = day_df.reindex(weekdays)

day_df
Out[26]:
count
pushblishDayName
Monday 7.0
Tuesday 59.0
Wednesday 18.0
Thursday 56.0
Friday 6.0
Saturday NaN
Sunday 52.0
In [27]:
# Creats barplot and sets x and y values as well as color
ax = day_df.reindex().plot.bar(
    y = 'count', 
    color = "mediumseagreen", 
    alpha = 0.5, 
    edgecolor = 'black'
)
# Removes legend
ax.get_legend().remove()
# Labels
plt.xlabel('')
plt.ylabel('Count', fontsize='25', labelpad=20)
# Title
plt.title('Uploading Schedule', fontsize='40', pad=30)
Out[27]:
Text(0.5, 1.0, 'Uploading Schedule')

From this barchart it's clear that this channel uploads most of its videos on Tuesday, Thursday, and Sunday, with no videos uploaded on a Saturday.

Dashboard¶

Dataframes were be imported into PowerBI to create this dashboard.

YoutubeAPI Dashboard.png

Improvement Ideas¶

Some possible improvements to this project could be:

  • Extract, transform, and load data from multiple YouTube channels by adding a list of channel IDs.
  • Modify color scheme and improve overall design by using Figma.
  • Program a set interval for data to update automatically.

Source and Inspiration¶

This proyect is based on 'Youtube API for Python: How to Create a Unique Data Portfolio Project ' by Thu Vu data analytics on YouTube. Source: https://youtu.be/D56_Cx36oGY


Version Control¶

Nov 29 2023 -- Version 1.0